GEM-PRO - Calculating Protein Properties¶
This notebook gives an example of how to calculate protein properties for a list of proteins. The main features demonstrated are:
- Information retrieval from UniProt and linking residue numbering sites to structure
- Calculating or predicting global protein sequence and structure properties
- Calculating or predicting local protein sequence and structure properties
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging¶
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'ssbio_protein_properties'
LIST_OF_GENES = ['b1276', 'b0118']
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: /tmp/ssbio_protein_properties: GEM-PRO project location
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: []
Out[8]:
uniprot | reviewed | gene_name | kegg | refseq | pdbs | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | ||||||||||||||
b0118 | P36683 | False | acnB | ecj:JW0114;eco:b0118 | NP_414660.1;WP_001307570.1 | 1L5J | PF00330;PF06434;PF11791 | Aconitate hydratase B | 2018-01-31 | 165 | 1997-11-01 | 3 | P36683.fasta | P36683.xml |
b1276 | P25516 | False | acnA | ecj:JW1268;eco:b1276 | NP_415792.1;WP_000099535.1 | NaN | PF00330;PF00694 | Aconitate hydratase A | 2018-01-31 | 153 | 2008-01-15 | 3 | P25516.fasta | P25516.xml |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[9]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
b0118 | P36683 | ecj:JW0114;eco:b0118 | 1L5J | P36683.fasta | P36683.xml |
b1276 | P25516 | ecj:JW1268;eco:b1276 | NaN | P25516.fasta | P25516.xml |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 16:52] [root] INFO: getUserAgent: Begin
[2018-02-05 16:52] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 16:52] [root] INFO: getUserAgent: End
A Jupyter Widget
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2018-02-05 16:52] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
Out[10]:
pdb_id | pdb_chain_id | uniprot | experimental_method | resolution | coverage | start | end | unp_start | unp_end | rank | |
---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||
b0118 | 1l5j | A | P36683 | X-ray diffraction | 2.4 | 1 | 1 | 865 | 1 | 865 | 1 |
b0118 | 1l5j | B | P36683 | X-ray diffraction | 2.4 | 1 | 1 | 865 | 1 | 865 | 2 |
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.7, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
A Jupyter Widget
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 16:53] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST
[2018-02-05 16:53] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[11]:
Below, we are mapping to previously generated homology models for E. coli. If you are running this as a tutorial, they won’t exist on your computer, so you can skip these steps.
-
GEMPRO.
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source] Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
import pandas as pd
import os.path as op
In [16]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
In [17]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]
homology_model_dict = {}
for i,r in tmp.iterrows():
homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
'file_type':'pdb'}}
my_gempro.get_manual_homology_models(homology_model_dict)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.
Downloading and ranking structures¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [18]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: Saved 1 structures total
Out[18]:
pdb_id | pdb_title | description | experimental_method | mapped_chains | resolution | chemicals | taxonomy_name | structure_file | |
---|---|---|---|---|---|---|---|---|---|
gene | |||||||||
b0118 | 1l5j | CRYSTAL STRUCTURE OF E. COLI ACONITASE B. | Aconitate hydratase 2 (E.C.4.2.1.3) | X-ray diffraction | A;B | 2.4 | F3S;TRA | Escherichia coli | 1l5j.pdb |
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [19]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
A Jupyter Widget
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2018-02-05 16:56] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[19]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
b0118 | REP-1l5j | True | pdb | 1l5j-A_clean.pdb |
b1276 | REP-ACON1_ECOLI | False | pdb | ACON1_ECOLI_model1_clean-X_clean.pdb |
Computing and storing protein properties¶
-
GEMPRO.
get_sequence_properties
(representatives_only=True)[source] Run Biopython ProteinAnalysis and EMBOSS pepstats to summarize basic statistics of all protein sequences. Results are stored in the protein’s respective SeqProp objects at
.annotations
Parameters: representative_only (bool) – If analysis should only be run on the representative sequences
In [20]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()
A Jupyter Widget
-
GEMPRO.
get_scratch_predictions
(path_to_scratch, results_dir, scratch_basename=’scratch’, num_cores=1, exposed_buried_cutoff=25, custom_gene_mapping=None)[source] Run and parse
SCRATCH
results to predict secondary structure and solvent accessibility. Annotations are stored in the protein’s representative sequence at:.annotations
.letter_annotations
Parameters: - path_to_scratch (str) – Path to SCRATCH executable
- results_dir (str) – Path to SCRATCH results folder, which will have the files (scratch.ss, scratch.ss8, scratch.acc, scratch.acc20)
- scratch_basename (str) – Basename of the SCRATCH results (‘scratch’ is default)
- num_cores (int) – Number of cores to use to parallelize SCRATCH run
- exposed_buried_cutoff (int) – Cutoff of exposed/buried for the acc20 predictions
- custom_gene_mapping (dict) – Default parsing of SCRATCH output files is to look for the model gene IDs. If your output files contain IDs which differ from the model gene IDs, use this dictionary to map model gene IDs to result file IDs. Dictionary keys must match model genes.
In [ ]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='scratch',
results_dir=my_gempro.data_dir,
num_cores=4)
-
GEMPRO.
find_disulfide_bridges
(representatives_only=True)[source] Run Biopython’s disulfide bridge finder and store found bridges.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.annotations['SSBOND-biopython']
Parameters: representative_only (bool) – If analysis should only be run on the representative structure
In [22]:
my_gempro.find_disulfide_bridges(representatives_only=False)
A Jupyter Widget
-
GEMPRO.
get_dssp_annotations
(representatives_only=True, force_rerun=False)[source] Run DSSP on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-dssp']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
In [23]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()
A Jupyter Widget
-
GEMPRO.
get_msms_annotations
(representatives_only=True, force_rerun=False)[source] Run MSMS on structures and store calculations.
Annotations are stored in the protein structure’s chain sequence at:
<chain_prop>.seq_record.letter_annotations['*-msms']
Parameters: - representative_only (bool) – If analysis should only be run on the representative structure
- force_rerun (bool) – If calculations should be rerun even if an output file exists
In [24]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()
A Jupyter Widget
Additional annotations¶
Loading feature files to the representative sequence¶
“Features” are currently loaded directly from UniProt, but if another feature file is available for each protein, it can be loaded manually.
In [25]:
# for g in my_gempro.genes_with_a_representative_sequence:
# g.protein.representative_sequence.feature_path = '/path/to/new/feature/file.gff'
Adding more properties¶
Additional global or local properties can be loaded after loading the saved GEM-PRO.
Make sure to add ``’seq_hydrophobicity-kd’`` to the list of columns to be returned later on!
Example with hydrophobicity¶
In [26]:
# Kyte-Doolittle scale for hydrophobicity
kd = { 'A': 1.8,'R':-4.5,'N':-3.5,'D':-3.5,'C': 2.5,
'Q':-3.5,'E':-3.5,'G':-0.4,'H':-3.2,'I': 4.5,
'L': 3.8,'K':-3.9,'M': 1.9,'F': 2.8,'P':-1.6,
'S':-0.8,'T':-0.7,'W':-0.9,'Y':-1.3,'V': 4.2 }
In [27]:
# Use Biopython to calculated hydrophobicity using a set sliding window length
from Bio.SeqUtils.ProtParam import ProteinAnalysis
window = 7
for g in my_gempro.genes_with_a_representative_sequence:
# Create a ProteinAnalysis object -- see http://biopython.org/wiki/ProtParam
my_seq = g.protein.representative_sequence.seq_str
analysed_seq = ProteinAnalysis(my_seq)
# Calculate scale
hydrophobicity = analysed_seq.protein_scale(param_dict=kd, window=window)
# Correct list length by prepending and appending "inf" (result needs to be same length as sequence)
for i in range(window//2):
hydrophobicity.insert(0, float("Inf"))
hydrophobicity.append(float("Inf"))
# Add new annotation to the representative sequence's "letter_annotations" dictionary
g.protein.representative_sequence.letter_annotations['hydrophobicity-kd'] = hydrophobicity
Global protein properties¶
Properties of the entire protein sequence/structure are stored in:
- The
representative_sequence
annotations
field - The
representative_structure
’s representative chain SeqRecord
These properties describe aspects of the entire protein, such as its molecular weight, the percentage of amino acids in a particular secondary structure, etc.
In [28]:
# Printing all global protein properties
from pprint import pprint
# Only looking at 2 genes for now, remove [:2] to gather properties for all
for g in my_gempro.genes_with_a_representative_sequence[:2]:
repseq = g.protein.representative_sequence
repstruct = g.protein.representative_structure
repchain = g.protein.representative_chain
print('Gene: {}'.format(g.id))
print('Number of structures: {}'.format(g.protein.num_structures))
print('Representative sequence: {}'.format(repseq.id))
print('Representative structure: {}'.format(repstruct.id))
print('----------------------------------------------------------------')
print('Global properties of the representative sequence:')
pprint(repseq.annotations)
print('----------------------------------------------------------------')
print('Global properties of the representative structure:')
pprint(repstruct.chains.get_by_id(repchain).seq_record.annotations)
print('****************************************************************')
print('****************************************************************')
print('****************************************************************')
Gene: b1276
Number of structures: 3
Representative sequence: P25516
Representative structure: REP-ACON1_ECOLI
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.08641975308641975,
'C': 0.007856341189674524,
'D': 0.06397306397306397,
'E': 0.06172839506172839,
'F': 0.025813692480359147,
'G': 0.08754208754208755,
'H': 0.020202020202020204,
'I': 0.04826038159371493,
'K': 0.04826038159371493,
'L': 0.09427609427609428,
'M': 0.028058361391694726,
'N': 0.037037037037037035,
'P': 0.05611672278338945,
'Q': 0.030303030303030304,
'R': 0.05723905723905724,
'S': 0.05723905723905724,
'T': 0.06060606060606061,
'V': 0.0819304152637486,
'W': 0.014590347923681257,
'Y': 0.03254769921436588},
'aromaticity-biop': 0.07295173961840629,
'instability_index-biop': 36.28239057239071,
'isoelectric_point-biop': 5.59344482421875,
'molecular_weight-biop': 97676.06830000057,
'monoisotopic-biop': False,
'percent_acidic-pepstats': 0.1257,
'percent_aliphatic-pepstats': 0.31089,
'percent_aromatic-pepstats': 0.09315,
'percent_basic-pepstats': 0.1257,
'percent_charged-pepstats': 0.2514,
'percent_helix_naive-biop': 0.29741863075196406,
'percent_non-polar-pepstats': 0.56341,
'percent_polar-pepstats': 0.43659,
'percent_small-pepstats': 0.53872,
'percent_strand_naive-biop': 0.27048260381593714,
'percent_tiny-pepstats': 0.29966000000000004,
'percent_turn_naive-biop': 0.2379349046015713}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.010101010101010102,
'percent_C-dssp': 0.2222222222222222,
'percent_E-dssp': 0.1739618406285073,
'percent_G-dssp': 0.03928170594837262,
'percent_H-dssp': 0.345679012345679,
'percent_I-dssp': 0.005611672278338945,
'percent_S-dssp': 0.09427609427609428,
'percent_T-dssp': 0.10886644219977554}
****************************************************************
****************************************************************
****************************************************************
Gene: b0118
Number of structures: 4
Representative sequence: P36683
Representative structure: REP-1l5j
----------------------------------------------------------------
Global properties of the representative sequence:
{'amino_acids_percent-biop': {'A': 0.11213872832369942,
'C': 0.011560693641618497,
'D': 0.06358381502890173,
'E': 0.06589595375722543,
'F': 0.03352601156069364,
'G': 0.08786127167630058,
'H': 0.017341040462427744,
'I': 0.05433526011560694,
'K': 0.056647398843930635,
'L': 0.10173410404624278,
'M': 0.026589595375722544,
'N': 0.035838150289017344,
'P': 0.06242774566473988,
'Q': 0.028901734104046242,
'R': 0.04508670520231214,
'S': 0.04161849710982659,
'T': 0.05433526011560694,
'V': 0.06705202312138728,
'W': 0.009248554913294798,
'Y': 0.024277456647398842},
'aromaticity-biop': 0.06705202312138728,
'instability_index-biop': 32.79631213872841,
'isoelectric_point-biop': 5.23931884765625,
'molecular_weight-biop': 93497.01500000065,
'monoisotopic-biop': False,
'percent_acidic-pepstats': 0.12948,
'percent_aliphatic-pepstats': 0.33526000000000006,
'percent_aromatic-pepstats': 0.08439,
'percent_basic-pepstats': 0.11907999999999999,
'percent_charged-pepstats': 0.24855,
'percent_helix_naive-biop': 0.29017341040462424,
'percent_non-polar-pepstats': 0.59075,
'percent_polar-pepstats': 0.40924999999999995,
'percent_small-pepstats': 0.53642,
'percent_strand_naive-biop': 0.3063583815028902,
'percent_tiny-pepstats': 0.30751,
'percent_turn_naive-biop': 0.22774566473988442}
----------------------------------------------------------------
Global properties of the representative structure:
{'percent_B-dssp': 0.016241299303944315,
'percent_C-dssp': 0.20765661252900233,
'percent_E-dssp': 0.14037122969837587,
'percent_G-dssp': 0.03480278422273782,
'percent_H-dssp': 0.3805104408352668,
'percent_I-dssp': 0.0,
'percent_S-dssp': 0.08236658932714618,
'percent_T-dssp': 0.13805104408352667}
****************************************************************
****************************************************************
****************************************************************
Local protein properties¶
Properties of specific residues are stored in:
- The
representative_sequence
’sletter_annotations
attribute - The
representative_structure
’s representative chain SeqRecord
Specific sites, like metal or metabolite binding sites, can be found in
the representative_sequence
’s features
attribute. This
information is retrieved from UniProt. The below examples extract
features for the metal binding sites.
The properties related to those sites can be retrieved using the
function get_residue_annotations
.
UniProt contains more information than just “sites”
In [29]:
# Looking at all features
for g in my_gempro.genes_with_a_representative_sequence[:2]:
g.id
# UniProt features
[x for x in g.protein.representative_sequence.features]
# Catalytic site atlas features
for s in g.protein.structures:
if s.structure_file:
for c in s.mapped_chains:
if s.chains.get_by_id(c).seq_record:
if s.chains.get_by_id(c).seq_record.features:
[x for x in s.chains.get_by_id(c).seq_record.features]
Out[29]:
'b1276'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1)), type='initiator methionine'),
SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(891)), type='chain', id='PRO_0000076661'),
SeqFeature(FeatureLocation(ExactPosition(434), ExactPosition(435)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(500), ExactPosition(501)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(503), ExactPosition(504)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(521), ExactPosition(522)), type='sequence conflict')]
Out[29]:
'b0118'
Out[29]:
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='mutagenesis site'),
SeqFeature(FeatureLocation(ExactPosition(1), ExactPosition(14)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(23), ExactPosition(35)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(41), ExactPosition(51)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(58), ExactPosition(72)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(82), ExactPosition(90)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(98), ExactPosition(104)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(104), ExactPosition(107)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(108), ExactPosition(111)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(111), ExactPosition(120)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(127), ExactPosition(137)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(140), ExactPosition(151)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(153), ExactPosition(157)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(165), ExactPosition(178)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(178), ExactPosition(182)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(184), ExactPosition(190)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(193), ExactPosition(197)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(197), ExactPosition(200)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(213), ExactPosition(216)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(219), ExactPosition(227)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(232), ExactPosition(243)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(247), ExactPosition(257)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(257), ExactPosition(261)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(263), ExactPosition(269)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(271), ExactPosition(278)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(279), ExactPosition(288)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(291), ExactPosition(295)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(305), ExactPosition(310)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(310), ExactPosition(314)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(314), ExactPosition(318)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(318), ExactPosition(321)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(323), ExactPosition(327)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(333), ExactPosition(341)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(343), ExactPosition(360)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(383), ExactPosition(391)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(391), ExactPosition(394)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(408), ExactPosition(413)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(414), ExactPosition(417)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(417), ExactPosition(427)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(437), ExactPosition(440)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(443), ExactPosition(448)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(450), ExactPosition(465)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(465), ExactPosition(468)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(478), ExactPosition(483)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(483), ExactPosition(486)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(491), ExactPosition(497)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(502), ExactPosition(507)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(511), ExactPosition(521)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(530), ExactPosition(538)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(545), ExactPosition(559)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(572), ExactPosition(576)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(576), ExactPosition(583)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(588), ExactPosition(597)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(597), ExactPosition(600)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(600), ExactPosition(603)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(604), ExactPosition(609)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(612), ExactPosition(632)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(637), ExactPosition(653)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(666), ExactPosition(673)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(673), ExactPosition(676)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(680), ExactPosition(683)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(690), ExactPosition(693)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(693), ExactPosition(696)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(696), ExactPosition(699)), type='turn'),
SeqFeature(FeatureLocation(ExactPosition(703), ExactPosition(707)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(713), ExactPosition(726)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(731), ExactPosition(737)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(741), ExactPosition(750)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(752), ExactPosition(760)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(769), ExactPosition(772)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(774), ExactPosition(777)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(783), ExactPosition(790)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(798)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(801), ExactPosition(805)), type='strand'),
SeqFeature(FeatureLocation(ExactPosition(807), ExactPosition(817)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(822), ExactPosition(834)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(836), ExactPosition(840)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(845), ExactPosition(848)), type='helix'),
SeqFeature(FeatureLocation(ExactPosition(849), ExactPosition(857)), type='helix')]
-
Protein.
get_residue_annotations
(seq_resnum, seqprop=None, structprop=None, chain_id=None, use_representatives=False)[source] Get all residue-level annotations stored in the SeqProp
letter_annotations
field for a given residue number.Uses the representative sequence, structure, and chain ID stored by default. If other properties from other structures are desired, input the proper IDs. An alignment for the given sequence to the structure must be present in the sequence_alignments list.
Parameters: - seq_resnum (int) – Residue number in the sequence
- seqprop (SeqProp) – SeqProp object
- structprop (StructProp) – StructProp object
- chain_id (str) – ID of the structure’s chain to get annotation from
- use_representatives (bool) – If the representative sequence/structure/chain IDs should be used
Returns: All available letter_annotations for this residue number
Return type: dict
In [30]:
metal_info = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = g.protein.representative_structure.id
res_info['chain_id'] = g.protein.representative_chain
metal_info.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info, columns=cols).set_index(['gene_id', 'seq_id', 'struct_id', 'chain_id', 'seq_resnum'])
Out[30]:
seq_residue | struct_residue | struct_resnum | seq_SS-sspro | seq_SS-sspro8 | seq_RSA-accpro | seq_RSA-accpro20 | struct_SS-dssp | struct_RSA-dssp | struct_ASA-dssp | struct_PHI-dssp | struct_PSI-dssp | struct_CA_DEPTH-msms | struct_RES_DEPTH-msms | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | seq_id | struct_id | chain_id | seq_resnum | ||||||||||||||
b1276 | P25516 | REP-ACON1_ECOLI | X | 435 | C | C | 435 | NaN | NaN | NaN | NaN | H | 0.059259 | 8.0 | -61.1 | -26.6 | 2.656722 | 2.813536 |
501 | C | C | 501 | NaN | NaN | NaN | NaN | S | 0.088889 | 12.0 | -61.0 | -50.0 | 1.999713 | 2.409119 | ||||
504 | C | C | 504 | NaN | NaN | NaN | NaN | G | 0.259259 | 35.0 | -56.0 | -45.6 | 1.999634 | 1.961484 | ||||
b0118 | P36683 | REP-1l5j | A | 710 | C | C | 710 | NaN | NaN | NaN | NaN | T | 0.118519 | 16.0 | -67.1 | -7.2 | 10.148960 | 10.009109 |
769 | C | C | 769 | NaN | NaN | NaN | NaN | - | 0.088889 | 12.0 | -67.8 | -28.3 | 8.296585 | 8.049832 | ||||
772 | C | C | 772 | NaN | NaN | NaN | NaN | G | 0.081481 | 11.0 | -50.2 | -38.0 | 8.282292 | 8.239369 |
Column definitions¶
gene_id
: Gene ID used in GEM-PRO projectseq_id
: Representative protein sequence IDstruct_id
: Representative protein structure ID, withREP-
prepended to it. 4 letter structure IDs are experimental structures from the PDB, others are homology modelschain_id
: Representative chain ID in the representative structure
seq_resnum
: Residue number of the amino acid in the representative sequencesite_name
: Name of the feature as defined in UniProtseq_residue
: Amino acid in the representative sequence at the residue numberstruct_residue
: Amino acid in the representative structure at the residue numberstruct_resnum
: Residue number of the amino acid in the representative structure
seq_SS-sspro
: Predicted secondary structure, 3 definitions (from the SCRATCH program)seq_SS-sspro8
: Predicted secondary structure, 8 definitions (SCRATCH)seq_RSA-accpro
: Predicted exposed (e) or buried (-) residue (SCRATCH)seq_RSA-accpro20
: Predicted exposed/buried, 0 to 100 scale (SCRATCH)
struct_SS-dssp
: Secondary structure (DSSP program)struct_RSA-dssp
: Relative solvent accessibility (DSSP)struct_ASA-dssp
: Solvent accessibility, absolute value (DSSP)struct_PHI-dssp
: Phi angle measure (DSSP)struct_PSI-dssp
: Psi angle measure (DSSP)struct_RES_DEPTH-msms
: Calculated residue depth averaged for all atoms in the residue (MSMS program)struct_CA_DEPTH-msms
: Calculated residue depth for the carbon alpha atom (MSMS)
Visualizing residues¶
-
StructProp.
view_structure
(only_chains=None, opacity=1.0, recolor=False, gui=False)[source] Use NGLviewer to display a structure in a Jupyter notebook
Parameters: - only_chains (str, list) – Chain ID or IDs to display
- opacity (float) – Opacity of the structure
- recolor (bool) – If structure should be cleaned and recolored to silver
- gui (bool) – If the NGLview GUI should show up
Returns: NGLviewer object
-
StructProp.
add_residues_highlight_to_nglview
(view, structure_resnums, chain=None, res_color=’red’)[source] Add a residue number or numbers to an NGLWidget view object.
Parameters: - view (NGLWidget) – NGLWidget view object
- structure_resnums (int, list) – Residue number(s) to highlight, structure numbering
- chain (str, list) – Chain ID or IDs of which residues are a part of. If not provided, all chains in the mapped_chains attribute will be used. If that is also empty, and exception is raised.
- res_color (str) – Color to highlight residues with
In [31]:
for g in my_gempro.genes:
# Gather residue numbers
metal_binding_structure_residues = []
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
res_info = g.protein.get_residue_annotations(f.location.end, use_representatives=True)
metal_binding_structure_residues.append(res_info['struct_resnum'])
print(metal_binding_structure_residues)
# Display structure
view = g.protein.representative_structure.view_structure()
g.protein.representative_structure.add_residues_highlight_to_nglview(view=view, structure_resnums=metal_binding_structure_residues)
view
[435, 501, 504]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :X ) and not hydrogen and ( 504 or 435 or 501 )
A Jupyter Widget
[710, 769, 772]
[2018-02-05 17:00] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )
A Jupyter Widget
Comparing features in different structures of the same protein¶
In [32]:
# Run all sequence to structure alignments
for g in my_gempro.genes:
for s in g.protein.structures:
g.protein.align_seqprop_to_structprop(seqprop=g.protein.representative_sequence, structprop=s)
In [33]:
metal_info_compared = []
for g in my_gempro.genes:
for f in g.protein.representative_sequence.features:
if 'metal' in f.type.lower():
for s in g.protein.structures:
for c in s.mapped_chains:
res_info = g.protein.get_residue_annotations(seq_resnum=f.location.end,
seqprop=g.protein.representative_sequence,
structprop=s, chain_id=c,
use_representatives=False)
res_info['gene_id'] = g.id
res_info['seq_id'] = g.protein.representative_sequence.id
res_info['struct_id'] = s.id
res_info['chain_id'] = c
metal_info_compared.append(res_info)
cols = ['gene_id', 'seq_id', 'struct_id', 'chain_id',
'seq_residue', 'seq_resnum', 'struct_residue','struct_resnum',
'seq_SS-sspro','seq_SS-sspro8','seq_RSA-accpro','seq_RSA-accpro20',
'struct_SS-dssp','struct_RSA-dssp', 'struct_ASA-dssp',
'struct_PHI-dssp', 'struct_PSI-dssp', 'struct_CA_DEPTH-msms', 'struct_RES_DEPTH-msms']
pd.DataFrame.from_records(metal_info_compared, columns=cols).sort_values(by=['seq_resnum','struct_id','chain_id']).set_index(['gene_id','seq_id','seq_resnum','seq_residue','struct_id'])
Out[33]:
chain_id | struct_residue | struct_resnum | seq_SS-sspro | seq_SS-sspro8 | seq_RSA-accpro | seq_RSA-accpro20 | struct_SS-dssp | struct_RSA-dssp | struct_ASA-dssp | struct_PHI-dssp | struct_PSI-dssp | struct_CA_DEPTH-msms | struct_RES_DEPTH-msms | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | seq_id | seq_resnum | seq_residue | struct_id | ||||||||||||||
b1276 | P25516 | 435 | C | ACON1_ECOLI | X | C | 435 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
E01201 | X | C | 435 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 435 | NaN | NaN | NaN | NaN | H | 0.059259 | 8.0 | -61.1 | -26.6 | 2.656722 | 2.813536 | ||||
501 | C | ACON1_ECOLI | X | C | 501 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
E01201 | X | C | 501 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 501 | NaN | NaN | NaN | NaN | S | 0.088889 | 12.0 | -61.0 | -50.0 | 1.999713 | 2.409119 | ||||
504 | C | ACON1_ECOLI | X | C | 504 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
E01201 | X | C | 504 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-ACON1_ECOLI | X | C | 504 | NaN | NaN | NaN | NaN | G | 0.259259 | 35.0 | -56.0 | -45.6 | 1.999634 | 1.961484 | ||||
b0118 | P36683 | 710 | C | 1l5j | A | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1l5j | B | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 710 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 710 | NaN | NaN | NaN | NaN | T | 0.118519 | 16.0 | -67.1 | -7.2 | 10.148960 | 10.009109 | ||||
769 | C | 1l5j | A | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
1l5j | B | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 769 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 769 | NaN | NaN | NaN | NaN | - | 0.088889 | 12.0 | -67.8 | -28.3 | 8.296585 | 8.049832 | ||||
772 | C | 1l5j | A | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||
1l5j | B | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
ACON2_ECOLI | X | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
E00113 | X | C | 772 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ||||
REP-1l5j | A | C | 772 | NaN | NaN | NaN | NaN | G | 0.081481 | 11.0 | -50.2 | -38.0 | 8.282292 | 8.239369 |